Marwah Alian, Ghazi Al-Naymat, Banda Ramadan, Using transliteration with entity resolution for Arabic datasets, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp593-597, 2017. |
Abstract |
Entity resolution (ER) is the operation of distinguishing records that return to the same real world entity. It is used to link records among datasets and to match query records in real-time with existing datasets. Indexing is a major step in the ER process that reduces the search space. Most existing indexing techniques that are utilized in the ER process are designed to work with English datasets. Such techniques may not be suitable for use with other languages, such as Arabic. In this paper, enhancement for indexing techniques that are designed to work with English datasets has been proposed to be used with Arabic language by applying transliteration on Arabic strings before performing the indexing step of the ER process. The proposed approach is experimented and compared with using word stems as blocking keys in the indexing st ep. The results show better matching accuracy for the use of transliteration over the use of words stems. |